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Abstract 

We propose in this paper a tracking algorithm which is able to 
adapt itself to different scene contexts. A feature pool is used 
to compute the matching score between two detected objects. 
This feature pool includes 2D, 3D displacement distances, 2D 
sizes, color histogram, histogram of oriented gradient (HOG), 
color covariance and dominant color An offline learning pro- 
cess is proposed to search for useful features and to estimate 
their weights for each context. In the online tracking process, 
a temporal window is defined to establish the links between 
the detected objects. This enables to find the object trajectories 
even if the objects are misdetected in some frames. A trajectory 
filter is proposed to remove noisy trajectories. Experimentation 
on different contexts is shown. The proposed tracker has been 
tested in videos belonging to three public datasets and to the 
Caretaker European project. The experimental results prove 
the effect of the proposed feature weight learning, and the ro- 
bustness of the proposed tracker compared to some methods 
in the state of the art. The contributions of our approach over 
the state of the art trackers are: (i) a robust tracking algorithm 
based on a feature pool, (ii) a supervised learning scheme to 
learn feature weights for each context, (iii) a new method to 
quantify the reliability of HOG descriptor, (iv) a combination 
of color covariance and dominant color features with spatial 
pyramid distance to manage the case of object occlusion. 

1 Introduction 

Many approaches have been proposed to track mobile objects 
in a scene [?]. The problem is to have tracking algorithms 
which perform well in different scene conditions (e.g. differ- 
ent people density levels, different illumination conditions) and 
to be able to tune their parameters. The ideas of an automatic 
control for adapting an algorithm to the context variations have 
akeady been studied [?, ?, ?]. In [?], the authors have pre- 
sented a framework which integrates knowledge and uses it to 
control image processing programs. However, the construction 
of a knowledge base requires a lot of time and data. Their study 
is restricted to static image processing (no video). In [?], the 
author has presented an architecture for a self-adaptive percep- 
tual system in which the "auto-criticism" stage plays the role of 
an online evaluation process. To do that, the system computes 
trajectory goodness score based on clusters of typical trajecto- 



ries. Therefore, this method can be only applied for the scenes 
where mobile objects move on well defined paths. In [?], the 
authors have presented a tracking framework which is able to 
control a set of different trackers to get the best possible per- 
formance. The approach is interesting but the authors do not 
describe how to evaluate online the tracking quality and the ex- 
ecution of three trackers in parallel is very expensive in terms 
of processing time. 

In order to overcome these limitations, we propose a track- 
ing algorithm that is able to adapt itself to different contexts. 
The notion of context mentioned in this paper includes a set of 
scene properties: density of mobile objects, frequence of oc- 
clusion occurrences, illumination intensity, contrast level and 
the depth of the scene. These properties have a strong effect on 
the tracking quality. In order to be able to track object move- 
ments in different contexts, we define firstly a feature pool in 
which each weighted feature combination can help the system 
to outperform its performance in each context. However, the 
parameter configuration of these features (i.e. determination of 
feature weight values) is a hard task because the user has to 
quantify correctly the importance of each feature in the con- 
sidered context. To facilitate this task, we propose an offline 
learning algorithm based on Adaboost [?] to compute feature 
weight values for each context. In this work, we have two as- 
sumptions. First, each video has a stable context. Second, for 
each context, there exists a training video set. 

The paper is organized as follows: The next section 
presents the feature pool and explains how to use it to com- 
pute link similarity between the detected objects. Section 3 de- 
scribes the offline learning process to tune the feature weights 
for each scene context. Section 4 shows in detail the different 
stages of the tracking process. The results of the experimenta- 
tion and validation can be found in section 5. A conclusion as 
well as future work are given in the last section. 



2 Feature pool and link similarity 
2.1 Feature pool 

The principle of the proposed tracking algorithm is based on 
the coherence of mobile object features throughout time. In 
this paper, we define a set of 8 different features to compute 
a Unk similarity between two mobile objects I and m within a 
temporal window (see figure[Tll. 



2.1.1 2D and 3D displacement distance similarity 

Depending on the object type (e.g. car, bicycle, walker), the 
object speed cannot exceed a fixed threshold. Let Dmax be the 
possible maximal 3D displacement of a mobile object for one 
frame in a video and d be the 3D distance of two considered 
objects, we define a similarity LSi between these two objects 
using the 3D displacement distance feature as follows: 



LSi = max{0, 1 - d/ {Dmax * n)) 



(1) 



where n is the temporal difference (frame unity) of the two 
considered objects. 

Similarly, we also define a similarity LS2 between two ob- 
jects using displacement distance feature in the 2D image co- 
ordinate system. 

2.1.2 2D shape ratio and area similarity 

Let Wi and Hi be the width and height of the 2D bounding 
box of object I. The 2D shape ratio and area of this object 
are respectively defined as Wi/Hi and WiHi. If no occlusions 
occur and mobile objects are well detected, shape ratio and area 
of a mobile object within a temporal window does not vary 
much even if the lighting and contrast conditions are not good. 
A similarity LS^ between two 2D shape ratios of objects I and 
m is defined as follows: 

LS3 = min{Wi/Hu W^/H^)/max{WilHu W^/H^) 

(2) 

Similarly, we also define the similarity LSi between two 
2D areas of objects I and m as follows: 

LSi = min{WiHi, WmHm)/max{WiHi, WmHm) (3) 

2.1.3 Color histogram similarity 

In this work, the color histogram of a mobile object is defined 

as a normalized RGB color histogram of moving pixels inside 
its bounding box. We define a link similarity LS5 between two 
objects I and m for color histogram feature as follows: 



LS. 



J2Hf min{Hi{k), Hm{k)) 



(4) 



where K isa parameter representing the number of histogram 

bins for each color channel (K = L.256), Hi{k) and H„i{k) 
are respectively the histogram values of object I, m at bin k. 

2.1.4 HOGsunUarity 

In case of occlusion, the system may fail to detect the full ap- 
pearance of mobile objects. The above features are then unreli- 
able. In order to address this issue, we propose to use the HOG 
descriptor to track locally interest points on mobile objects and 
to compute the trajectory of these points. The HOG similarity 
between two objects is defined as a value proportional to the 
number of pairs of tracked points belonging to both objects. In 
[?], the authors propose a method to track FAST points based 



on their HOG descriptors. However the authors do not com- 
pute the rehabihty level of the obtained point trajectories. In 
this work, we define a method to quantify the reliability of the 
trajectory of each interest point by considering the coherence of 
the Frame-to-Frame (F2F) distance, the direction and the HOG 
similarity of the points belonging to a same trajectory. We as- 
sume that the variation of these features follows a Gaussian 
distribution. 

Let {pi,p2, ■■■,Pi) be the trajectory of a point. Point pi is 
on the current tracked object and point pi-i is on an object 
previously detected. We define a coherence score Sf"* of F2F 
distance of point as follows: 



1 



V2^ 



(5) 



where di is the 2D distance between and Pi-i, jii and Oi 
are respectively the mean and standard deviation of the F2F 
distance distribution formed by the set of points (pi , p2 , • • • , Pi ) ■ 

In the same way, we compute the direction coherence score 
5f and the similarity coherence score Sf'^^'^ of each interest 
point. Finally for each interest point pi on the tracked object I, 
we define a coherence score S\ as the mean value of these three 
coherence scores. 

Let P be the set of interest point pairs which trajectories 
pass through two considered objects oi and 0^; S\ (5™ re- 
spectively) be the coherence score of point i (j respectively) 
on object I (m respectively) belonging to set P. We define the 
similarity of HOG between these two objects as follows: 



LSe=mm{^'=^^' ' 



Ml 



Mr, 



(6) 



where M; and Mm are the total number of interest points de- 
tected on objects I and m. 

2.1.5 Color covariance similarity 

Color covariance is a very useful feature to characterize the 
appearance model of an image region. In particular, the 
color covariance matrix enables to compare regions of dif- 
ferent sizes and is invariant to identical shifting of color 
values. This becomes an advantageous property when ob- 
jects are tracked under varying illumination conditions. In 
[?], for a point i in a given image region R, the authors 
define a covariance matrix C; corresponding to 11 descrip- 
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where {x, y) is pixel location, Rxy,Gxy, and Bxy are RGB 
channel values, and M, O correspond to gradient magnitude 
and orientation in each channel at position {x,y). 

We use the distance defined by [?] to compare two covari- 
ance matrices: 



p{Ci,Cj) = 



(7) 



k=l 



where F is the number of considered image descriptors (F = 
11 in this case), Xk{Ci, Cj) is the generaUzed eigenvalue of Cj 



and Cj . 



In order to take into account the spatial coherence of the 
color covariance distance and also to manage occlusion cases, 
we propose to use the spatial pyramid distance defined in [?]. 
The main idea is to divide the image region of a considered 
object by a set of sub-regions. For each level i (i > 0), the 
considered region is divided by a set of 2' x 2' sub-regions. 
Then we compute the local color covariance distance for each 
pair of corresponding sub-regions. The computation of each 
sub-region pair helps to evaluate the spatial structure coherence 
between two considered objects. In the case of occlusions, the 
color covariance distance between two regions corresponding 
to occluded parts is very high. Therefore, we take only a half of 
the lowest color covariance distances (i.e. highest similarities) 
for each level to compute the final color covariance distance. 

The similarity of this feature is defined as a function of the 
spatial pyramid distance: 



LS7 = max{0, 1 - dcov/ Dcov_max) 



(8) 



where dcov is the spatial pyramid distance of the color co- 
variance between two considered objects, and Dcov_max is the 
maximum distance for two color covariance matrices to be con- 
sidered as similar. 

2.1.6 Dominant color similarity 

Dominant color descriptor (DCD) has been proposed by 
MPEG-7 and is extensively used for image retrieval [?]. This 
is a reliable color feature because it takes into account only 
important colors of the considered image region. DCD of an 
image region is defined as F = {{ci,Pi}, i — where 
A is the total number of dominant colors in the considered im- 
age region, q is a 3D RGB color vector, pi is its occurrence 
percentage, with X^iLi Pi = 1- 

Let Fi and F2 be the DCDs of two image regions of consid- 
ered objects. The dominant color distance between these two 
regions is defined using the similarity measure proposed in [?]. 
Also, similar to the color covariance feature, in order to take 
into account the spatial coherence and occlusion cases, we pro- 
pose to use the spatial pyramid distance for the dominant color 
feature. The similarity of this feature is defined in the function 
of the spatial pyramid distance as follows: 



1 



^DC 



(9) 



where doc is the spatial pyramid distance of dominant colors 
between two considered objects. 

2.2 Link similarity 

Using the eight features we have described above, a link simi- 
larity LS{oi, Om) is defined as a weighted combination of fea- 
ture similarities LSi between objects o; and Om' 



LS{0l, Om) 



(10) 



where Wk is the feature weight (corresponding to its effective- 
ness), at least one weight is not null. 



3 Learning feature weights 

Each feature described above is effective for some particular 
context conditions. However, how can the user quantify cor- 
rectly the feature significance for a given context? In order to 
address this issue, we propose in this paper an offline super- 
vised learning process using the Adaboost algorithm [?]. First 
a weak classifier is defined per feature. Then a strong classifier 
which combines these eight weak classifiers (corresponding to 
the eight features) with their weights is learnt. 

For each context, we select a learning video sequence rep- 
resentative of this context. First, for each object pair (o;, o™) 
(called a training sample) in two consecutive frames, denoted 
opi (i — 1..N), we classify it into two classes {+1, -1}: 
Hi = +1 if the pair belongs to the same tracked object and 
yi = —1 otherwise. For each feature k (k ^ 1..8), we define a 
classification mechanism for a pair opi as follows: 



hk{opi 



-1 if LSk{0l, Ora) > Thi 

-1 otherwise 



(11) 



where LSk{oi, o™) is the similarity score of feature k (defined 
in section IZTT i between two objects o; and Om, Thi is a pre- 
defined threshold representing the minimum feature similarity 
considered as similar 

The loss function for Adaboost algorithm at iteration z for 
each feature k is defined as: 

N 

Efc = ^ Dz{i)max{0, -yihk{opi)) (12) 



where Dz [i) is the weight of the training sample opi at iteration 
z. At each iteration z, the goal is to find k whose loss function 
Efc is minimum, hk and €k (corresponding to value k found) are 
denoted hz and e^. The weight of this weak classifier denoted 
az is computed as follows: 



1, 1-e. 

Uz = -In 

2 ez 



(13) 



We then update the weight of samples: 



1/N, 



tfz = 



Dz+i{i)='{ (14) 
DA^)eM-<^.v.h^{opO) ^ Otherwise 

where Az is a normalization factor so that J^'^ -Dz+i(j) — 1. 

At the end of the Adaboost algorithm, the feature weights 
are determined for the learning context and allow to compute 
the link similarity defined in formulafTOl 

4 The proposed tracking algorithm 

The proposed tracking algorithm needs a list of detected ob- 
jects in a temporal window as input. The size of this tempo- 
ral window (denoted T2) is a parameter The proposed tracker 
is composed of three stages. First, the system computes the 
link similarity between any two detected objects appearing in 
a given temporal window to establish possible links. Second, 



the trajectories that include a set of consecutive Unks resuhing 
from the previous stage, are then computed as the system gets 
the highest possible total of global similarities (see section l43T l. 
Finally, a filter is applied to remove noisy trajectories. 

4.1 Establishment of object links 

For each detected object pair in a given temporal window of 
size T2, the system computes the link similarity (i.e. instanta- 
neous similarity) defined in formula [TOl A temporal link is es- 
tabUshed between these two objects when their link similarity 
is greater or equal to Thi (presented in equationfTTI). At the end 
of this stage, we obtain a weighted graph whose vertices are the 
detected objects in the considered temporal window and whose 
edges are the temporally established links associated with the 
object similarities (see figure[T]). 




Figure 1 . The graph representing the established links of the 
detected objects in a temporal window of size T2 frames. 



4.2 Long term similarity 

In this section, we study similarity score between an object 
oi detected at t and the trajectory of o,„ detected previously, 
called long term similarity (to distinguish with the link similar- 
ity score between two objects). By assuming that the variations 
of the 2D area, shape ratio, color histogram, color covariance 
and dominant color features of a mobile object follow a Gaus- 
sian distribution, we can use the Gaussian probability density 
function (PDF) to compute this score. Also, longer the trajec- 
tory of o„i is, more reliable this similarity is. Therefore, for 
each feature k in these features, we define a long term similar- 
ity score between object oi and trajectory of o„i as follows: 



LTk{0l, Or, 



1 



1 T 
min{ — , 1) 



(15) 



where si is the value of feature k for object /, /i.,„ and am are 
respectively mean and standard deviation values of feature k of 
last Q-objects belonging to the trajectory of o,,n (Q is a prede- 
fined parameter), T is time length (number of frames) of Om 
trajectory. Thanks to the selection of the last Q-objects, the 
long term similarity can take into account the latest variations 
of the o„i trajectory. 



For the left features (2D, 3D displacement distance and 
HOG), the long term similarity are set to the same values of 
link similarity. 

4.3 Trajectory determination 

The goal of this stage is to determine the trajectories of the 
mobile objects. For each detected object o; at instant t, we 
consider all its matched objects o„i (i.e. objects with temporal 
established links) in previous frames that do not have yet offi- 
cial links (i.e. trajectories) to any objects detected at t. For such 
an object pair (o;, Om), we define a global score GS{oi, o,„) 
as follows: 



GS{0l, Om) 



J2l=l WkGSkjoi, Om) 



(16) 



where Wk is the weight of feature k (resulting from learning 
phase, see section O, GSk{oi, Om) is the global score of fea- 
ture k between o; and o,„, defined as a function of link similar- 
ity and long term similarity of feature k: 

GSk{oi, o,„) = (1 - /3)LSk{oi, o„i) + (3LTk{oi, o,„) (17) 

where LSk{oi, Om) is the link similarity of feature k between 
the two objects o; and Om, LTk{oi, Om) is their long term sim- 
ilarity defined in section l4!2l (3 is the weight of long term sim- 
ilarity and is defined as follows: 



/3 = min{ — ,Th4) 



(18) 



where T, Q are presented in section \42\ and T/14 is the maxi- 
mum expected weight for the long term similarity. 

The object having the highest global similarity is con- 
sidered as a temporal father of object o/. After considering all 
objects at instant t, if more than one object get Om as a father, 
the pair (o;, Om) which GS{oi, Om) value is the highest will 
be kept and the link between this pair is official (i.e. become 
officially a trajectory segment). An object is no longer tracked 
if it cannot establish any official links in T2 consecutive frames. 

4.4 Trajectory filtering 

Noise usually appears when wrong detection or misclassifica- 
tion (e.g. due to low image quality) occurs. Hence a static 
object (e.g. a chair, a machine) or some image regions (e.g. 
window shadow, merged objects) can be detected as a mobile 
object. However, such noise usually only appears in few frames 
or have no real motion. We thus use temporal and spatial fil- 
ters to remove potential noises. A trajectory is considered as a 
noise if one of the two following conditions is satisfied: 



T 

dm 



< Th5 

< The 



where T is time length of the considered trajectory; dmax is 
the maximum spatial length of this trajectory; T/15, The are 
predefined thresholds. 



5 Experimentation and Validation 

The objective of this experimentation is to prove the effect of 
feature weight learning, also to compare the performance of the 
proposed tracker with other trackers in the state of the art. To 
this end, in the first part, we test the proposed tracker with two 
complex videos (many moving people, high occlusion occur- 
rence frequency) which are respectively provided by the Care- 
taker European projecQ and the TRECVid dataset [?]. These 
two videos are tested in both cases: without and with the fea- 
ture weight learning. In the second part, five videos belonging 
to two public datasets ETISEClandCaviail are experimented, 
and the tracking result (with the feature learning) is compared 
with some other approaches in the state of the art. 

In order to evaluate the tracking performance, we use the 
three tracking evaluation metrics defined in the ETISEO project 
[?]. The first tracking evaluation metric AIi measures the per- 
centage of time during which a reference object (ground truth 
data) is correctly tracked. The second metric M2 computes 
throughout time how many tracked objects are associated with 
one reference object. The third metric AI3 computes the num- 
ber of reference object IDs per tracked object. These metrics 
must be used together to obtain a complete performance evalu- 
ation. Therefore, we also define a tracking metric M taking the 
average value of these three tracking metrics. The four metric 
values are defined in the interval [0, 1 ] . The higher the metric 
value is, the better the tracking algorithm performance gets. 

In this experimentation, we use the people detection algo- 
rithm based on the HOG descriptor of the OpenCV library. So 
we focus the experimentation on the sequences containing peo- 
ple movements. However the principle of the proposed track- 
ing algorithm is not dependent on the tracked object type. For 
learning feature weights, we use video sequences that are dif- 
ferent from the tested videos but which have a similar context. 

The first tested video (provided by the Caretaker project) 
depicts people moving in a subway station. The frame rate 
of this sequence is 5 fps (frames / second) and the length is 
5 min (see image |5^). We have learnt feature weights on a 
sequence of 2000 frames. The learning algorithm selects = 
0.5 (color histogram feature) and we = 0.5 (HOG feature). 

The second tested sequence (belonging to the TRECVid 
dataset) depicts the movements of people in an airport (see im- 
age 13)). It contains 5000 frames and lasts 3 min 20 sec. We 
have learnt feature weights on a sequence of 5000 frames. The 
learning algorithm selects wi — 0.24 (3D distance displace- 
ment), W4 = 1 (2D area) and W5 — 0.76 (color histogram). 

Table [Upresents the tracking results in two cases: without 
and with feature weight learning. We can find that with the 
proposed learning scheme, the tracker performance increases 
in both tested videos. Also, the processing time of the tracker 
also decreases significantly because many features are not used. 

The two following tested videos belong to ETISEO dataset. 
The first tested ETISEO video shows a building entrance, de- 

'http://cordis.europa.eu/ist/kct/caretaker_synopsis.htm 

"http://www-sop.inria.fr/orion/ETISEO/ 

^http://homepages.inf.ed.ac.uk/rbf/CAVIARDATAl/ 



noted ETI-VS1-BE-18-C4. It contains 1108 frames and frame 
rate is 25 fps. In this sequence, there is only one person mov- 
ing (see image |5};). We have learnt feature weights on a se- 
quence of 950 frames. The learning algorithm has selected 
the 3D displacement distance feature as the unique feature for 
tracking in this context. The result of the learning phase is rea- 
sonable since there is only one moving person. 

The second tested ETISEO video shows an underground 
station denoted ETI-VS1-MO-7-C1 with occlusions. The dif- 
ficulty of this sequence consists in the low contrast and bad 
illumination. The scene depth is quite important (see image 
|5}l)- This video sequence contains 2282 frames and frame rate 
is 25 fps. We have learnt feature weights on a sequence of 500 
frames. The color covariance feature is selected as the unique 
feature for tracking in this context. It is a good solution because 
the dominant color and HOG feature do not seem to be effec- 
tive due to bad illumination. Also, the size and displacement 
distance features are not reliable because their measurements 
do not seem to be discriminative for far away moving people 
from the camera. 

In these two experiments, tracker results from seven dif- 
ferent teams (denoted by numbers) in ETISEO have been pre- 
sented: 1, 8, 11, 12, 17, 22, 23. Because names of these teams 
are hidden, we cannot determine their tracking approaches. Ta- 
ble |2] presents performance results of the considered trackers. 
The tracking evaluation metrics of the proposed tracker get the 
highest values in most cases compared to other teams. 

The last three tested videos belong to the Caviar dataset 
(see image |5^). In this dataset, we have selected the same se- 
quences experimented in [?] to be able to compare each other: 
OneStopEnter2cor, OneStopMoveNoEnterlcor and OneStop- 
MoveNoEnter2cor In these three sequences, there are 9 per- 
sons walking in a corridor. The proposed approach can track 
all of them. However there are three noisy trajectories in the 
last sequence because of wrong detection occurred in a long 
period. Table [3] presents the result summary for these videos. 
TP (True Positive) refers to the number of correct tracked tra- 
jectories. FN (False Negative) is the number of lost trajectories. 
FP (False Positive) represents the number of noisy trajectories. 
Compared to [?], our proposed tracker have better values in all 
of these three indexes. 

6 Conclusion and Future work 

We have presented in this paper an approach which combines a 
large set of appearance features and learn tracking parameters. 
The quantification of HOG descriptor reliability and the combi- 
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Table 1. Summary of tracking results in both cases: without 
and with feature weight learning. 
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Figure 2. Illustration of five tested videos: a. Caretaker b. Trecvid c. ETI-VS1-BE-8-C4 d. ETI-VS1-MO-7-C1 e. Caviar 
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Table 2. Summary of tracking results for two ETISEO videos. 
BE denotes ETI-VS1-BE-18-C4 sequence, MO denotes ETI- 
VS1-MO-7-C1 sequence. The highest values are printed bold. 
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Table 3. Summary of tracking results for three Caviar videos 



nation of color covariance, dominant color with spatial pyramid 
distance help to increase the robustness of the tracker for man- 
aging occlusion cases. The learning of feature significances 
for different video contexts also helps the tracking algorithm to 
adapt itself to the context variation problem. The experimen- 
tation proves the effect of the feature weight learning, also the 
robustness of the proposed tracker compared to some other ap- 
proaches in the state of the art. We propose in future work an 
automatic context detection to increase the auto-control capac- 
ity of the system and to remove the two assumptions given in 
this paper (presented in section 1). 
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